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1 Introduction 



Semiparametric models with large number of predictors arise frequently in many con- 
temporary statistical studies. Large data set and high-dimensionality characterize many 
contemporary scientific endeavors ([6]; [8]). Statistical models with many predictors are 
frequently employed to enhance the explanatory and predictive powers. At the same 
time, semiparametric modeling is frequently incorporated to balance between modeling 
biases and "curse of dimensionality". Profile likelihood techniques ([23]) are frequently 
applied to this kind of semiparametric models. When the number of predictors is large, 
it is more realistic to regard it growing with the sample size. Yet, few results are avail- 
able for semiparametric profile inferences when the number of parameters diverges with 
sample size. This paper focuses on profile likelihood inferences with diverging number 
of parameters in the context of the generalized varying coefficient partially linear model 
(GVCPLM). 

GVCPLM is an extension the generalized linear model ([2D]) arid the generalized 
varying-coefficient model ([12]; [I])- It allows some coefficient functions to vary with cer- 
tain covariates U such as age ([9]), toxic exposure level or time variable in a longitudinal 
data or survival analysis ([22]). Therefore, general interactions, not just the linear inter- 
action as in parametric models, between the variable U and these covariates are explored 
nonparametrically. 

If Y is a response variable and (U, X, Z) is the associated covariates, then by letting 
/i(w,x,z) = E{Y\(U,X, Z) = (w,x,z)}, the GVCPLM takes the form 



where g(-) is a known link function, f3 a vector of unknown regression coefficients and 
ct(-) a vector of unknown regression functions. One of the advantages over the varying 
coefficient model is that GVCPLM allows more efficient estimation when some coefficient 
functions are not really varying with U, after adjustment of other genuine varying ef- 
fects. It also allows more interpretable model, where primary interest is focused on the 
parametric component. 

1.1 A motivating example 

We use a real data example to demonstrate the need for GVCPLM. The Fifth National 
Bank of Springfield faced a gender discrimination suit in which female received substan- 
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Table 1: Proportions of employees having job grade higher than 4 



Covariate TotalYrsExp 

0-7 8-16 >17 

Age < 35 1/11 1/9 
Age > 35 2/11 8/21 

tially smaller salaries than male employees. This example is based on a real case with 
data dated 1995. Only the bank's name is changed. See Example 11.3 of [2]. Among 
208 employees, eight variables are collected. They include employee's salary; age; year 
hired; number of years of working experience at another bank; gender; PC Job, a dummy 
variable with value 1 if the employee's job is computer related; educational level, a cat- 
egorical variable with categories 1 (finished school), 2 (finished some college courses), 3 
(obtained a bachelor's degree), 4 (took some graduate courses), 5 (obtained a graduate 
degree); job grade, a categorical variable indicating the current job level, the possible 
levels being 1-6 (6 the highest). 

[9] has conducted such a salary analysis using an additive model with quadratic spline 
and does not find a significant evidence of gender difference. However, salary is directly 
related to the job grade. With the adjustment for the job grade, the salary discrimination 
can not easily be seen. An important question then arises if female employees have lower 
probability getting promoted. In analyzing such probability, a common tool will be the 
logistic regression, a class of the generalized linear model (for example, see [20J). 

To this end, we create a binary response variable HighGrade4, indicating if Job 
Grade is greater than 4. The associated covariates are Female(l for female employee and 
otherwise), Age, TotaIYrsExp(total years of working experience), PC Job, Edu(level 
of education). If the covariate Female has a significantly negative coefficient, then it 
would suggest that female employees are harder to promote to higher grade jobs. 

However, in a simple logistic regression, the effect of a covariate cannot change with 
another covariate nonparametrically. Table [1] shows the proportion of employees having 
a job grade higher than 4, categorized by Age and TotalYrsExp. Clearly interactions 
between Age and TotalYrsExp have to be considered. 

This can be done by creating categorical variables over the covariate Age. However 
this would increase the number of predictors considerably if we create many categories of 
Age. More importantly, we do not know where to draw the borders of each Age category 
and how many categories should be produced. This problem is nicely overcome if we allow 
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the coefficient of TotalYrsExp to vary with Age, so that we obtain a coefficient function 
of Age for TotalYrsExp. See section [4731 for a detail analysis of the data. 

If interactions between different variables are considered, then the number of predic- 
tors will be large compare with the sample size n = 208. This motivates us to consider 
the setting p n — > oo as n — > oo and present general theories in section [21 where such a 
setting will be faced by many modern statistical applications. 

1.2 Goals of the paper 

When the number of parameters (3 is fixed and the link g is identity, the model (11. ip has 
been considered by [33], [H] and [31], and pQ. [7] propose a profile-kernel inference for 
such a varying coefficient partial linear model (VCPLM) and [18] considered a backfitting- 
based procedure for model selection in VCPLM. All of these papers rely critically on the 
explicit form of the estimation procedures and the techniques can not easily be applied 
to the GVCPLM. 

Modern statistical applications often involve estimation of large number of parame- 
ters. It is of interest to derive asymptotic properties for the profile likelihood estimators 
under model (II. ip when number of parameters diverges. The fundamental questions arise 
naturally whether the profile likelihood estimator ([23]) still possesses efficient sampling 
properties, whether the profile likelihood ratio test for the parametric component pos- 
sesses Wilks type of phenomenon, namely whether the asymptotic null distributions are 
independent of nuisance functions and parameters, and whether the usual sandwich for- 
mula provides a consistent estimator of the covariance matrix of the profile likelihood 
estimator. These questions are poorly understood and will be thoroughly investigated in 
Section 2. Pioneering work on statistical inference with diverging number of parameters 
include [2] which gave related results on M-estimators, and [25] which analyzed a regular 
exponential family under the same setting. [5] studied the penalized likelihood approach 
under such setting, whereas [10] investigated a semiparametric model with growing num- 
ber of nuisance parameters. 

Another goal of this paper is to provide an efficient algorithm for computing profile 
likelihood estimates under the model (II. ip . To this end, we propose a new algorithm, 
called the accelerated profile-kernel algorithm, based on an important modification of 
the Newton- Raphson iterations. Computational difficulties ([19]) of the profile- kernel 
approach is significantly reduced, while nice sampling properties of such approach over 
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the backfitting algorithm (e.g. [13]) are retained. This will be convincingly demonstrated 
in Section HI where the Poisson and Logistic specifications are considered for simulations. 
A new difference-based estimate for the parametric component is proposed as an initial 
estimate of our proposed profile-kernel procedure. Our method expands significantly the 
idea used in [32] and [7] for the partial linear model. 

The outline of the paper is as follows. In Section [2] we briefly introduce the profile 
likelihood estimation with local polynomial modeling and present our main asymptotic 
results. Section [3]turns to the computational aspect, discussing the elements of computing 
in the accelerated profile-kernel algorithm. Simulation studies and an analysis of real data 
set are given Section HI The proofs of our results are given in Section [51 and technical 
details in the appendix. 

2 Properties of profile likelihood inference 

Let (Y n i, Xj, Z n j, Ui), where 1 < % < n be a random sample where Y n i is a scalar response 
variable, Ui, Xj £ M. q and Z n j £ M Pn are vectors of explanatory variables. We consider 
model (II. ip with (3 n and Z n having dimensions p n — > oo as n — > oo. Like the distri- 
butions in the exponential family, we assume that the conditional variance depends on 
the conditional mean so that Var("K \U, X, Z n ) = V(fi(u, X, Z n )) for a given function V 
(Our result is applicable even when V is multiplied by an unknown scale). Then, the 
conditional quasi-likelihood function is given by 



As in [28], we denote by cxp n {u) the 'least favorable curve' of the nonparametric function 
a(u), which is defined as the one that maximizes 



with respect to t], where Eq is the expectation taken under the true parameters olq(u) 
and (3 nQ . As will be discussed in section I2.1[ through the use of least favorable curve, 
no undersmoothing of the nonparametric component is required to achieve asymptotic 
normality when p n is diverging with n. Note that ot,p n0 (u) = o.q(u). Under some mild 
conditions, it satisfies 




EoiQig-'i^X + (3 n T Z n ), Y n )\U = u} 



(2.1) 







E {Qig-\ri T X + /3 n T Z n ), Y n ) \U = u}\ 




0. 
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The profile-likelihood function for (3 n is then 

n 

Qn(P n ) = ^Q{ g - 1 (a fin (u i ) T x i + fiz rii ) t Y ni }, (2.3) 



i=l 



if the least-favorable curve atp (•) is known. 

The least-favorable curve defined by ( 12.11) can be estimated by its sample version 
through a local polynomial regression approximation. For U in a neighborhood of u, 
approximate the j th component of atp (•) as 

, rn , , daJu) d p aAu) . 

aj (u) « + _^(t/_ u ) + ... + _^(t/_ u )7pi 

= Ooy + aij(Z7 — «) H ha P3 -(f/-u)7p!- 

Denoting a r = (a r i, • • • , a rq ) T for r = 0, . . . ,p, for each given (3 n , we then maximize the 
local likelihood 

n p 

J2Q{9~\J2*r T MUi - u) r /r\ + pZZ ni ),Y ni }K h {Ui - u) (2.4) 

i=l r=0 

with respect to ao, • • • ,a p , where K(-) is a kernel function and Kh(t) = K(t/h)/h is a 
re-scaling of K with bandwidth h. Thus, we get estimate ctp (u) = a (u). 

Plugging our estimates into the profile-kernel likelihood function (I2.3p . we have 

n 

Qn(P n ) = ^g{^" 1 (a /3n (^) T X i + ^Z ra ),F m }. (2.5) 

i=l 

Maximizing Q n (/3 n ) with respect to /3 n to get /3 n . With the varying coefficient 
functions are estimated as dtp. (u). 

One property of the profile quasi-likelihood is that the first and second order Bartlett's 
identities continue to hold. In particular, with the definition given by (12.31) . then for any 
f3 n , we have 

See [28] for more details. These properties give rise to the asymptotic efficiency of the 
profile likelihood estimator. 
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2.1 Consistency and asymptotic normality of (3 n 

We need Regularity Conditions (A) - (G) in Section [5] for the following results. 

Theorem 1 (Existence of profile likelihood estimator). Assume that Conditions (A)- 
(G) are satisfied. Ifp^/n — > as n — > oo and h = 0(n~ a ) with (4(p + 1)) 1 < a < 1/2, 
then there is a local maximizer f3 n e f2„ of Q n ((3 n ) such that ||/3 n — (3 n0 \\ = Op{\Jp n /n). 

The above rate is the same as the one established by [H] for the M-estimator. 

Note that the optimal bandwidth h = 0(n~ 1 ^ 2p+3 ^) is included in Theorem [TJ Hence 
y / n/pn-consistency is achieved without the need of undersmoothing of the nonparametric 
component. In particular, when p n is fixed, the result is in line with those, for instance, 
by [27] in a different context. 

Define I n (/3 n ) = n _1 E / 3 ji (|^-|^-), which is an extension of the Fisher matrix. Since 
the dimensionality grows with sample size, we need to consider the arbitrary linear com- 
bination of the profile kernel estimator j3 n as stated in the following theorem. 

Theorem 2 (Asymptotic normality). Under Conditions (A) - (G), if p\/n = o(l) and 
h = 0(n~ a ) for 3/(10(p + 1)) < a < 2/5, then the consistent estimator {3 n in Theorem\J\ 
satisfies 

v^A n iy 2 (/3 n0 )(^ n - (3 n0 ) N(0, G), 

where A n is an I x p n matrix such that A n A^ — > G, and G is an I x I nonnegative 
symmetric matrix. 

A remarkable technical achievement of our result is that it does not require under- 
smoothing of the nonparametric component, as in Theorem [TJ thanks to the profile like- 
lihood approach. The key lies in a special orthogonality property of the least favorable 
curve (see equation (12 .2\\ and Lemma [2]). Asymptotic normality without undersmoothing 
is also proved in [30] for both backfitting and profiling methods. 

Theorem [2] shows that profile likelihood produces a semi-parametric efficient estimate 
even when the number of parameters diverges. To see this more explicitly, let p n = r be 
a constant. Then, by taking A n = I r , we obtain 

The asymptotic variance of f3 n achieves the efficient lower bound given, for example, in 



7 



2.2 Profile likelihood ratio test 

After estimation of parameters, it is of interest to test the statistical significance of certain 
variables in the parametric component. Consider the problem of testing linear hypotheses: 

H : A n (3 n0 = 0< — vH x : A n (3 n0 ^ 0, 

where A n is an I x p n matrix and A n A\ = 1[ for a fixed /. Note that both the null 
and the alternative hypotheses are semi-parametric, with nuisance functions <*(■). The 
generalized likelihood ratio test (GLRT) is defined by 

T n = 2{supQ n (/3 n ) - sup Q n ((3 n )}. 

Note that the testing procedure does not depend explicitly on the estimated asymptotic 
covariance matrix. The following theorem shows that, even when the number of param- 
eters diverges with sample size, T n still follows a chi-square distribution asymptotically, 
without reference to any nuisance parameters and functions. This reveals the Wilk's 
phenomenon, as termed in [TT] . 

Theorem 3 Assuming Conditions (A) - (G), under Hq, we have 

provided that p^/n = o(l) and h = 0(n~ a ) for 3/(10 (p + 1)) < a < 2/5. 

2.3 Consistency of the sandwich covariance formula 

The estimated covariance matrix for (3 n can be obtained by the sandwich formula 

t n = n 2 {v 2 g n (AJ}- 1 co>{vQ n (^)}{v 2 g n (^j}- 1 , 

where the middle matrix has (j, k) entry given by 

( 1 OQ ni n ) 1 dQ ni n ) \ 
\nj^ d(3 nj d(3 nk ]' 

With the notation E n = J~ 1 (/3 n0 ), we have the following consistency result for the sand- 
wich formula. 
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Theorem 4 Assuming Conditions (A) - (G). If p^/n = o(l) and h = 0(n a ) with 
(4(p + l))" 1 < a < 1/2, we have 

A n T, n A^ - A n Y, n An as n —> oo 

for any I x p n matrix A n such that A n A^ = G. 

This result provides a simple way to construct confidence intervals for (3 n . Simulation 
results show that this formula indeed provides a good estimate of the covariance of n 
for a variety of practical sample sizes. 

3 Computation of the estimates 

Finding /3 n to maximize the profile likelihood (12.51) poses some interesting challenges, as 
the function dtp (u) in (12.51) depends on j3 n implicitly (except the least-square case). The 
full profile-kernel estimate is to directly employ the Newton-Raphson iterations: 

0CW-1) = ffl _ {V 2 QM* ] )r l VQ n (f3i% (3.1) 

starting from the initial estimate j3^\ We will call the estimate (3$ and ex. (k){u) the 
fc-step estimate ([3]; [26]). The initial estimate for (3 n is critically important for the 
computational speed. We will propose a new and fast initial estimate in Section 13.11 
The first two derivatives of VQ n (f3 n ) is given by 

n 

VQn(P n )=J2^Pn)(Zni + &'p n (U i )X i ), 

i=l 
n 

V 2 g„(/3J = ^g 2i (/3J(Z m + a^(^)X,)(Z m + a^(^)X,) T ( 3 .2) 

i=l 

where qi(x,y) = ^ ? Q(g- 1 (x),y), qki(P n ) = qk(m ni (f3 n ),Y ni ) (k = 1,2) with rh ni (f3 n ) = 
G-a (f^) T Xj + Z^/3 n . In the above formulae, 6l' s (u) = — is a p n by q matrix and 
dp (u) is the r th component of ap n (u). 

As the first two derivatives of atp n (u) are hard to compute in ( 13.21) . one can employ 
the backfitting algorithm, which iterates between (12 .4p and (I2.3p . This is really the same 
as the fully iterated algorithm (13. ip but ignores the functional dependence of ctp (u) 
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in (12.51) on (3 n ; it uses the value of (3 n in the previous step of the iteration as a proxy. 
More precisely, the backfitting algorithm treats the terms a.'^ (u) and a,"^ (it) in (13. 2p 
as zero and computes m ni ((3 n ) using the value of j3 n from the previous iteration. The 
maximization is thus much easier to carry out, but the convergence speed can be reduced. 
See [13] and [19] for more descriptions of the two methods and some closed-form solutions 
proposed for the partially linear models. 

Between these two extreme choices is our modified algorithm, which ignores the com- 
putation of the second derivative of dtp (u) in (13. ip . but keeps its first derivative in the 
iteration. Namely, the second term in (13.21) is treated as zero. Details will be given 
in Section 13.21 It turns out that this algorithm improves significantly the computation 
with achieved accuracy. At the same time, it enhances dramatically the stability of the 
algorithm. We will term the algorithm as the accelerated profile-kernel algorithm. 

When the quasi-likelihood becomes a square loss, the accelerated profile-kernel algo- 
rithm is exactly the same as that used to compute the full profile likelihood estimate, 
since ct/3 n (-) is linear in f3 n . 

3.1 Difference-based estimation 

We generalize the difference-based idea to obtain an initial estimate (3$ . The idea has 
been used in [32] and [7] to remove the nonparametric component in the partially linear 
model. 

We first consider the specific case of the GVCPLM: 

Y = <x(U) T X + (3 n T Z n + e. (3.3) 

This is the varying-coefficient partially linear model studied by [33] and [31] • Let the 
random sample {(Ui, X.J , Z^, li)}™ =1 be from the model (I3.3p . with the data ordered 
according to the U^s. Under mild conditions, the spacing Ui + j — Ui is Op(l/n), so that 

a(U i+j ) -a(Ui) «7o + 7i(tfi+j ~U t ), j = l,---,q. (3.4) 

Indeed, it can be approximately zero; the linear term is used to reduce the approximation 
errors. 

For given weights Wj (its dependence on i is suppressed for simplicity), define 

q+l 9+1 9+1 

Y i = W 3 Y i+3-^ Z ni = W j Z n(i+j-l), e* = W 3 £ i+j-l- 

j=l 3=1 3=1 
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If we choose the weights to satisfy Yl 9 j=i w j^-i+j-i = 0, then using (I3.3P and (13.41) . we 
have 

Y* » 7 o T X^ 1 + 7 i T WjUw-iXiv-i + ffKi + £ *> 

Ignoring the approximation, which is of order Op(n _1 ), the above is a multiple regression 
model with parameters (7o,7i,/3 n ). The parameters can be found by a weighted least 
square fit to the (n — q) starred data. This yields a root-n consistent estimate of j3 n , as 
the above approximation for the finite q is of order Op(n~ 1 ). 

To solve Y^j=i = 0, we need to find the rank of the matrix (Xj, • • • , X i+(? ), 

denoted it by r. Fix q + 1 — r of the w/s and the rest can be determined uniquely by 
solving the system of linear equations for {wj, j = 1, • • • ,q + 1}. For random designs, 
with probability 1, r — q. Hence, the direction of the weights {wj,j = 1, • ■ • , q + 1} is 
uniquely determined. For example, in the partial linear model, g = 1 and Xj = 1. Hence, 
(wi, W2) = c(l, —1) and the constant c can be taken to have a norm one. This results in 
the difference based estimator in [32] and [7]. 

To use the differencing idea to obtain an initial estimate of f3 n for the GVCPLM, 
we apply the transformation of the data. If g is the link function, we use g(Y i ) as the 
transformed data and proceed with the difference-based method as for the VCPLM. Note 
that for some models like the logistic regression with logit link and Poisson log-linear 
model, we need to make adjustments in transforming the data. We use g{y) = log( 1 ^^ ) 
for the logistic regression and g(y) = log(y + 5) for the Poisson regression. Here, the 
parameter 5 is treated as a smoothing parameter like h, and its choice will be discussed 
in Section |3~4"1 

3.2 Accelerated profile-kernel algorithm 

As mentioned before, the accelerated profile-kernel algorithm needs to compute a.'^ (u), 
which will be replaced by its consistent estimate given in the following theorem. The 
proof is in section 

Theorem 5 Under Regularity Conditions (A)-(G), provided y^(h + c n log 1 ^ 2 (l//i)) = 
o(l) where c n = {nh)~ l l 2 , we have for each (3 n E Q n , 

= -{it^iPnlZniXjK^Ui-u)} ■ S.J2<l2i(/3 n )X i XjK h (U i -u)\ 

M=l J ^ i=l ' 

being a consistent estimator of a'p (u) which holds uniformly in u £Q. 
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Since the function g^-, •) < by Regularity Condition (D), by ignoring the second 



ensures the Newton-Raphson update of the profile-kernel procedure can be carried out 
smoothly. The intuition behind the modification is that, for a neighborhood around the 
true parameter {3 n0 , the least favorable curve ctp (u) should be approximately linear in 



3.3 One-step estimation for the nonparametric component 

Given (3 n = f3^\ we need to compute <xp n {u) in order to compute rh n i((3 n ) and hence the 
modified gradient vector and Hessian matrix in (13. ip . This is the same as estimating the 
varying coefficient functions under model (11.11) with known f3 n . jl] propose a one-step 
local MLE, which is shown to be as efficient as the fully iterated one. They also propose 
an efficient algorithm to compute these varying coefficient functions. Their algorithm can 
be directly adapted here. Details can be found in [I]. 

3.4 Choice of bandwidth 

As mentioned at the end of Section 13.11 in addition to choosing the bandwidth h, we 
have an extra smoothing parameter 5 to be determined due to the adjustments to the 
transformation of the response Y ni . This two dimensional smoothing parameters (5, h) can 
be selected by a i^-fold cross-validation, using the quasi-likelihood as a criterion function. 
As demonstrated in Section HI the practical accuracy can be achieved in several iterations 
using the accelerated profile- kernel algorithm. Hence, the profile-kernel estimate can be 
computed rapidly. As a result, the i-T-fold cross-validation is not too computationally 
intensive, as long as K is not too large (e.g. K=5 or 10). 

4 Numerical properties 

To evaluate the performance of estimator ct(-), we use the square-root of average errors 



term in (13. 2p . the modified V 2 



Q n (f3 n ) in equation (13.21) is still negative-definite. This 



(RASE) 




1/2 
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over n gr id = 200 grid points {uk}. The performance of the estimator (3 n is assessed by 
the generalized mean square error (GMSE) 

GMSE = n -(3 nO ) T B0 n -(3 nO ), 

where B = EZ n Z T n . 

Throughout our simulation studies, the dimensionality of parametric component is 
taken as p n = [l-Sri 1 / 3 ] and the nonparametric component as q = 2 in which Xi — 1 and 
X 2 ~ N(0, 1). The rate p n = P (n 1/3 ) is not the same as presented in the theorems in 
section El but we use this to show the capability of handling a higher rate of parameters 
growth for the accelerated profile-kernel method. In addition, the covariates (Z^,X2) T is 
a (p n +l) — dimensional normal random vector with mean zero and covariance matrix ((Tij), 
where = 0.5'* _jf '. Furthermore, we always take U ~ U(0, 1) independent of the other 
covariates. Finally, we use SD mac j to denote the robust estimate of standard deviation, 
which is defined as interquartile range divided by 1.349. The number of simulations is 
400 except that in Table 1 (which is 50) due to the intensive computation of the fully 
iterated profile-kernel estimate. 

Poisson model. The response Y, given (U, X, Z n ), has a Poisson distribution with the 
mean function fi(U, X, Z n ) where 

log(//(£/, X, Z B )) = X T a(U) + Z£/3 n . 

We have /3 n0 = (0.5, 0.3, —0.5, 1, 0.1, —0.25, 0, • • • , 0) T , the p n -dimensional parameters. 
The coefficient functions are given by 

ai(u) = 4 + sin (2nu), and a 2 (n) = 2n(l — n). 

Bernoulli model. The response Y, given (U, X, Z n ), has a Bernoulli distribution with 
the success probability given by 

p(U, X, Z n )) = exp{X T a(f/) + Z T n (3 n }/[l + exp{X T a(t/) + Z^/3J]. 

The p n — dimensional parameters are f3 n0 = (3, 1, —2, 0.5, 2, —2, 0, ■ • ■ , 0) T and the varying 
coefficient functions is given by 

cui(u) = 2(w 3 + 2u 2 — 2m), and a 2 (u) = 2 cos(27tm). 

Throughout our numerical studies, we use the Epanechnikov kernel K(u) = 0.75(1 — 
u 2 ) + and the 5-fold cross-validation to choose a bandwidth h and 5. With the assistance of 
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the 5-fold cross-validation, we chose 5 = 0.1 and h = 0.1, 0.08, 0.075 and 0.06 respectively 
for n = 200, 400, 800 and 1500 for the Poisson model. For the Bernoulli model, 5 = 0.005 
and h = 0.45, 0.4, 0.25 and 0.18 were chosen respectively for n = 200, 400, 800 and 1500. 

Note that X 2 and the Z ni 's are not bounded r.v.s as needed in condition (A) in section 
[5j However, these still satisfy the moment conditions needed in the proofs, and condition 
(A) is imposed to merely simplify these proofs. Condition (B) is satisfied mainly because 
the correlations between further Z^s are weak, and condition (C) is satisfied because it 
involves products of standard normal r.v.s which are bounded in the first two moments. 

4.1 Comparisons of algorithms 

Table 2: Computation time and accuracy for different computing algorithms 



n p n backfitting accelerated profile-kernel full profile-kernel 
Median and SD ma d ( in parentheses ) of computing times in seconds 


200 fO 
400 13 


.6(.0) .7(.0) 77.2(.2) 
.8(.0) 1.4(.0) 463.2(.9) 


Median and SD ma d (in parentheses) of GMSE (multiplied by 10 4 j 


200 fO 
400 13 


10.72(6.47) 5.45(2.71) 9.74(14.67) 
5.63(4.39) 2.78(1.19) 5.26(9.46) 


Median RASE relative to the oracle estimate 


200 fO 
400 13 


.848 .970 .895 
.856 .986 .882 



We first compare the computing times and the accuracies among three algorithms: 
3-step backfitting, 3-step accelerated profile-kernel and fully-iterated profile-kernel algo- 
rithms. All of them use the difference-based estimate as the initial estimate. Table [2] 
summarizes the results based on the Poisson model with 50 samples. 

With the same initial values, the backfitting algorithm is slightly faster than the ac- 
celerated profile-kernel algorithm, which in turn by far faster than the full profile-kernel 
algorithm. Our experience shows that the backfitting algorithm needs more than 20 
iterations to converge without improving too much the GMSE. In terms of the accu- 
racy of estimating the parametric component, the accelerated profile-kernel algorithm is 
about twice as accurate as the backfitting algorithm and the full profile-kernel one. This 
demonstrates the advantage of keeping the curvature of the least-favorable function in 
the Newton- Raphson algorithm. For the nonparametric component, we compare RASEs 
of three algorithms with those based on the oracle estimator, which uses the true value of 
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(3 n . The ratios of the RASEs based on the oracle estimator and those based on the three 
algorithms are reported in Table 1. It is clear that the accelerated profile-kernel estimate 
performs very well in estimating the nonparametric components, mimicking very well the 
oracle estimator. The second best is the backfitting algorithm. 

We have also compared the three algorithms using the Bernoulli model. Our proposed 
accelerated profile-kernel estimate still performs the best in terms of accuracy, though the 
improvement is not as dramatic as those for the Poisson model. We speculate that the 
poor performance of the full profile-kernel estimate is due to its unstable implementation 
that is related to computing the second derivatives of the least-favorable curve. 

Table 3: Medians of the percentages of GMSE based on the accelerated profile-kernel 
estimates 







Poisson 


Bernoulli 


n 


Pn 


AF/DBE 


AF/3S 


AF/DBE AF/3S 


200 


10 


8.2 


99.9 


64.1 101.7 


400 


13 


6.0 


100.2 


52.7 104.7 


800 


16 


5.0 


100.1 


50.9 102.6 


1500 


20 


4.2 


100.0 


46.4 100.5 



We next demonstrate the accuracy of the three-step accelerated profile-kernel estimate 
(3S), compared with the fully- iterated accelerated profile-kernel estimate (AF) (iterating 
until convergence), and the difference-based estimate (DBE), which is our initial estimate. 
Table [3] reports the ratios of GMSE based on 400 simulations. It demonstrates convinc- 
ingly that with the DBE as the initial estimate, three iterations achieve the accuracy that 
is comparably with the fully iterated algorithm. In fact, the one-step accelerated profile- 
kernel estimates improve dramatically (not shown here) our initial estimate (DBE). On 
the other hand, the DBE itself is not accurate enough for GCVPLM. 

The effect of bandwidth choice on the estimation of parametric component is sum- 
marized in Table HI Denote by hcv the bandwidth chosen by the cross-validation. We 
scaled the bandwidth up and down by using a factor of 1.5. For illustration, we use 
the one-step accelerated profile-kernel estimate. The results for three-step profile-kernel 
estimate are similar. We evaluate the performance for all components using GMSE and 
for the specific component (3$ using MSE (the results for other components are similar). 
We do not report all the results here to save the space. It is clear that the GMSE does 
not sensitively depends on the bandwidth, as long as it is reasonably close to hey. This 
is consistent with our asymptotic results. 
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Table 4: One-step estimate of parametric components with different bandwidths 





Poisson 


Bernoulli 


n p n 


Median and SD rna d of 
GMSE xlO 5 
h C v 1.5/icv 


Mean and SD of 
MSE x 10 4 for fa 
0.66ft-cv ^cv 


Median and SD mac i of 

GMSE xlO 
0.66/icv hcv 


200 10 
400 13 
800 16 
1500 20 


5.9(3.0) 6.4(3.3) 
3.1(1.4) 3.0(1.4) 
1.7(0.7) 1.7(0.6) 
1.1(0.3) 1.1(0.4) 


993(112) 995(105) 
1004(67) 1001(65) 
999(47) 999(46) 
1000(32) 1000(32) 


8.2(4.4) 8.4(5.1) 
4.8(2.2) 5.4(2.5) 
2.7(1.0) 2.7(1.1) 
1.8(0.7) 1.8(0.6) 



SD and SD ma( j are shown in parentheses. 



4.2 Accuracy of profile-likelihood inferences 



Table 5: Standard deviations and estimated standard errors 





Poisson, valuesx 1000 

h fa 


Bernoulli, valuesx 10 


n p n 


SD SD m 


SD SD m 


SD SD m 


SD SD m 


200 10 
400 13 
800 16 
1500 20 


9.1 8.5(1.3) 
6.0 5.6(0.7) 

3.7 3.8(0.3) 

2.8 2.7(0.2) 


9.9 9.4(1.3) 
6.5 6.1(0.7) 
4.1 4.2(0.4) 
3.1 3.0(0.2) 


3.6 2.9(.4) 
2.3 2.1(.2) 

1.7 1.6(.l) 
1.2 1.2(.l) 


3.2 2.8(.4) 
2.2 2.0(.2) 
1.5 1.5(.l) 
1.1 l.l(.l) 



SDmad arc shown in parentheses. 



To test the accuracy of the sandwich formula for estimating standard errors, the 
standard deviations of the estimated coefficients (using the one-step accelerated profile- 
kernel estimate) are computed from the 400 simulations using /lev- These can be regarded 
as the true standard errors (columns labeled SD). The 400 estimated standard errors 
are summarized by their median (columns SD m ) and its associated SD mac j. Table 4 
summarizes the results. Clearly, the sandwich formula does a good job, and accuracy 
gets better as n increases. 

We now study the performance of GLRT in Section 12.21 To this end, we consider the 
following null hypothesis: 

H : fa = fa = ■ ■ ■ = P Pn = 0. 

We examine the power of the test under a sequence of the alternative hypotheses indexed 
by a parameter 7 as follows: 

Ht : fa = fa = 7, fa = for j > 8. 

When 7 = 0, the alternative hypothesis becomes the null hypothesis. 
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(a) Null density estimation (n=400,h=hcv) 



(b) Power function (n=4Q0,h=hcv) 
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0.015 0.03 0.045 0.06 



(c) Null density estimation (n=800,h=0.66hcv) 



(d) Power function (n=200,h=hcv) 




12 16 20 24 28 




0.25 0.5 0.75 



Figure 1: (a) Asymptotic null distribution (solid) and estimated true null distribution 
(dotted) for the Poisson model, (b) The power function at significant level a = 0.01, 0.05 
and 0.1. The captions for (c) and (d) are the same as those in (a) and (b) except that 
the Bernoulli model is now used. 



Under the null hypothesis, the GLRT statistics are computed for each of 400 sim- 
ulations, using the one-step accelerated profile-kernel estimates. Their distribution is 
summarized by a kernel density estimate and can be regarded as the true null distribu- 
tion. This is compared with the asymptotic null distribution Xp n -e- Figures [Qa) and (c) 
show the results when n = 400. The finite sample null density is seen to be reasonably 
close to the asymptotic one, except for the Monte Carlo error. 

The power of the GLR test is studied under a sequence of alternative models, pro- 
gressively deviating from the null hypothesis, namely, as 7 increases. Again, the one-step 
accelerated profile-kernel algorithm is employed. The power functions are calculated at 
three significance levels: 0.1, 0.05 and 0.01, using the asymptotic distribution. They are 
the proportion of rejection among the 400 simulations and are depicted in Figures d^b) 
and (d). The power curves increase rapidly with 7, which shows the GLR test is power- 
ful. The powers at 7 = are approximately the same as the significance level except the 
Monte Carlo error. This shows that the size of the test is reasonably accurate. 
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4.3 A real data example. 

This is the analysis of the data in section 11.11 in where details of data and variables are 
given. 

To examine the nonlinear effect of age and its nonlinear interaction with the expe- 
rience, we appeal to the following GVCPLM (interactions between age and covariates 
other than TotalYrsExp are considered but found to be insignificant): 



log ( — — — ) = ai(Age) + a 2 (Age) TotalYrsExp 



1 - Ph, 

(4.1) 

+ fa Female + /5 2 PCJob + ^ /^2+iEdui 

i=l 

where pn is the probability of having a high grade job. Formally, we are testing 

H : fa = < — >Hi : p x < 0. (4.2) 

Table 6: Fitted coefficients (sandwich SD) for model (14.11) 



Response 


Female 


PCJob 


Edui 


Edu2 


Edu 3 


Edu 4 


HighGradc4 
HighGrade5 


-1.96(.57) 
-2.22(.59) 


-0.02(.76) 
-1.96(.61) 


-5.14(.85) 
-5.69(.67) 


-4.77(.98) 
-5.95(.97) 


-2.72(.52) 
-3.09(.72) 


-2.85(.96) 
-1.26(1.10) 



A 20-fold CV is employed to select the bandwidth h and the parameter 5 in the 
transformation of the data. This yields hcv = 24.2, 5qv = 0.1. Table [6] shows the 
results of the fit using the three-step accelerated profile-kernel estimate. The coefficient 
for Female is significantly negative. The education plays also an important role in 
getting high grade job. All coefficients are negative, as they are contrasted with the 
highest education level. The PCJob does not seem to play any significant role in getting 
promotion. Figures [2](a) and (b) depict the estimated coefficient functions. They show 
that as age increases one has a better chance of being in a higher job grade, and then 
the marginal effect of working experience is large when age is around 30 or less, but 
start to fall as one gets older. However, the second result should be interpreted with 
caution, as the variables Age and TotalYrsExp are highly correlated (Figure [21(c)). 



The standardized residuals (y — Ph)/ \/ph(1 ~ Ph) against Age is plotted in Figure Efd). 
It shows that the fit seems reasonable. Other diagnostic plots also look reasonable, but 
they are not shown here. 
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(a) Fitted coefficient function a^-) 



(b) Fitted coefficient function a 2 ( ) 




(c) TotalYrsExp against Age (d) Standardsized Residuals against Age 




Figure 2: (a) Fitted coefficient function ai(-) (b) Fitted coefficient function a^O) ■ (c) The 
scatter plot 'TotalYrsExp' Against 'Age', (d) Standardized residuals against the variable 
'Age'. 

We have conducted another fit using a binary variable HighGrade5, which is only 
when job grade is less than 5. The coefficients are shown in Table [6] and the Female 
coefficient is close to the first fit. 

We now employ the generalized likelihood ratio test to the problem (I4.2p . The GLR 
test statistic is 14.47 with one degree of freedom, resulting in a P- value of 0.0001. We have 
also conduct the same analysis using HighGrade5 as the binary response. The GLR 
test statistic is now 13.76 and the associated P-value is 0.0002. The fitted coefficients are 
summarized in Table 5. The result provides stark evidence that even after adjusting for 
other confounding factors and variables, female employees of the Fifth National Bank is 
harder to get promoted to a high grade job. 

Not shown in this paper, we have conducted the analysis again after deleting 6 data 
points corresponding to 5 male executives and 1 female employee having many years of 
working experience and high salaries. The test results are still similar. 
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5 Technical proofs 



In this section the proofs of Theorems 1-4 will be given. We introduce some notations 
and regularity conditions for our results. In the following and thereafter, the symbol <E> 
represents the Kronecker product between matrices, and X min (A) and A max (/1) denotes 
respectively the minimum and maximum eigenvalues of a symmetric matrix A. We let 
Qni(fl n ) be the i-th summand of (12. 3ft . 

Denote the true linear parameter by j3 n0 , with parameter space Q n C 1R P '\ Let 
Aifc = J™ oo u k K(u)du and A P (X) = (^i+j)o<i,j< P <S) XX T . Set 



'nii 



pt(t) = {dg-\t)/dt) l /V{g~\t)), m m (f3 n ) = apJJJifX, + f3 T n Z 

, , x da (w) ( r )„, \ d 2 O.V(u) 

^ ' 9I3 " Al V ; d(3 n d(3 T n 

Regularity Conditions: 

(A) The covariates Z n and X are bounded random variables. 

(B) The smallest and the largest eigenvalues of the matrix I n (f3 n0 ) is bounded away 
from zero and infinity for all n. In addition, E [V T (5ni(/3„o)V(5ni(/3 n o)] 4 — 0(p*)- 

(C) Ep \ J l Z Qnl{1 ^ landEa I J^ 1 ^ ? are bounded for all n, with I = 1, ■ ■ ■ , 4 

nki ' "QPrikj 

and j = 0, 1. 

(D) The function q2(x,y) < for x G K. and y in the range of the response variable, 
and 'E {q 2 (m nl ((3 n ),Y nl )A p (X 1 )\U = u} is invertible. 

(E) The functions V"(-) and g"'{-) are continuous. The least-favorable curve acp (u) is 
three times continuously differentiable in f3 n and u. 

(F) The random variable U has a compact support Q. The density function fu{u) of U 
has a continuous second derivative and is uniformly bounded away from zero. 

(G) The kernel K is a bounded symmetric density function with bounded support. 

Note the above conditions are assumed to hold uniformly in u £ Q. Condition (A) 
is imposed just for the simplicity of proofs. The boundedness of covariates is imposed 
to ensure various products involving <#(■,■), X and Z n have bounded first and second 
moments. Conditions (B) and (C) are uniformity conditions on higher-order moments of 
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the likelihood functions. They are stronger than those of the usual asymptotic likelihood 
theory, but they facilitate technical proofs. Condition (G) is also imposed for simplicity 
of technical arguments. All of these conditions can be relaxed at the expense of longer 
proofs. 

Before proving Theorem [H we need two important lemmas. Lemma [1] concerns the 
order approximations to the least-favorable curve otp (•), while Lemma [2] holds the key 
to showing why undersmoothing is not needed in Theorems [T] and [2J Let c n = (n/i) -1 / 2 , 
a 0l a n , ■ ■ • , and a p/3n maximize Q23D, and cfyju) = 8P< ^ ( " ) . Set 



V i — n / 



T 



Lemma 1 Under Regularity Conditions (A) - (G), for each (3 n G Vt n , the following holds 
uniformly in u EVt: 

||a 0/3 » - = P (^ +1 + c^og^d/h)). 

Likewise, the norm of the k th derivative of the above with respect to any (3 n j 's, for k = 
1, ■ • • ,4, all have the same order uniformly in u G Q. 

Proof of Lemma [TJ Our first step is to show that, uniform in u G Q, 

P* = A^W n + P (h p+1 + c n log 1/2 (l//i)), 



where 



A n = f u (u)E {p 2 (a l3n (U) T X+ZZpjA p (X)\U = u}, 

n 

W n = hc n ^ Y ni )X*K h (Ui - u), 



i=l 
n 



hc 2 n 92(a ni , Y ni )X*XfK h (Ui - u). 

i=l 
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Since expression (j2.4p is maximized at (ao/3 , • • • , a^g ) T , /3 maximizes 

n 

Z n (/T) = ft ^{QGT^Xf /3* + a ni ), Y ni ) - Q(g'\a ni ), Y m )} 
i=i 

1 ftr 3 " 

= W ? I/T + -/3* T A n /3* + ym)( X f FfKhVi - u), 

i=l 

where rji lies between a. ni and ct ni + c n X* T /3*. The concavity of / n (/3*) is ensured by 
Condition (D). Note that K(-) is bounded, we have under Conditions (A) and (C), the 
third term on the right hand side is bounded by 

OpinhclElq^Y^WXtfKniUt - u)\) = P {c n ) = o P (l). 

Direct calculation yields E A n = —A n + 0(h p+1 ) and Var ((A n )jj) = 0((nh)~ l ) so that 
mean-variance decomposition yields 

A n = -A n + P (h p+1 ). 

Hence we have 

l n ((3*) = Wl(3* - i/T T A n /3* + o P (l). (5.1) 
Note that A n is a sum of i.i.d. random variables of kernel form, by a result of [21], 

A n = -A n + P {h p+1 + c n log 1/2 (l//i)} (5.2) 

uniformly in « G fi. Hence by the Convexity Lemma (|24j). equation (15 .ip also holds 
uniformly in (3* e C for any compact set C. Using Lemma A. 1 of [5], it yields that 

Bup^-A^Wnl -^0. (5.3) 
Furthermore, by its definition, (3 solves the local likelihood equation: 

n 

9i("m + c n X* T P*, Y n JXZK h (Ui -u) = 0. (5.4) 

i=l 

Expanding qi(a ni + c n X* T /3 , •) at a ni yields 

W n + A n /3 + ^ g 3 (a ni + &, Y ni )X*(X* T f3 ) 2 K h (Ui - u) = (5.5) 

i=l 
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where Q lies between and c„X* T /3 . Using Conditions (A) and (C), the last term has 
order Op(c^hn\\P || 2 ) = Op(c n \\(3 || 2 ). With this, combining (15.2ft and ( 15.51) . we obtain 



P* = A^Wn + P (^ +1 + C n log^l/fc)) 

holds uniformly in u G Q by (15.31) . Using the result of [21] on W„, we obtain 
||a 0/3 » - a Pn (u)\\ = P {W +1 + c n log 1/2 (l//i)) 



(5.6) 



(5.7) 



which holds uniformly in u G Q. 

Differentiate both sides of (15. 4p w.r.t. (5 n j, 

dot, 



d(3 



^ ?2 (a m + Cn Xf^,y ni )|^ + c n ^J Xj|x i *if h (l7 i -u) = 0, (5. 



which holds for all iiGll. By Taylor's expansion and similar treatments to (15.51) . 

dp 



Wi + W 2 + (A n + B* + B 2 )— ^— + P (c re ||/3 || 2 ) = 



id 



where 



dOL 7 



= hc n J2q2(ani,Y ni )—^X*K h (Ui-u) 



1=1 



B 



^c n ^ qsia-ni, Y ni )c n X.* T (3 ^-^X.*K h (Ui - u) 

Op n j 



i=l 



hc 2 n Qsictm, Y ni )c n X.fp*X.*XfK h (U i 



u 



i=l 



hr 2 



i=l 



rp A * 

with lies between and c n X* f3 . The above equations hold for all ufSl. The order 
of W 2 is sn 
A„. Hence 



of W 2 is smaller than that of W*, and the orders of B* and B 2 are smaller than that of 



d(3 



A ? ; 1 Wi+op(log 1 / 2 (l// i ) + c; 1 ^ +1 ) 



"J 



uniformly in u G Q. From this, for j — 1, • • • ,p n , we have 
<9a 0/3n (u) da 0n (u 



p (^+ 1 + Cre l0g 1 /2(l// i )) 



(5.9) 



uniformly inn G H. Differentiating ( 15.41) again w.r.t. (3 n k and repeating as needed, we get 
the desired results for higher order derivatives by following similar arguments as above. 
□ 
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Lemma 2 Under Regularity Conditions (A) - (G), ifp s n /n — * for s > 5/4, h = 0(n a ) 
with (2s(p + l))^ 1 < a < 1 — s -1 , i/ien /or eac/i /3 n G Q n; 

^ 1 / 2 ||vg n (/3 n )-vg n (/3j|| = o P (i). 

Proof of Lemma [21 Define 

Kl = r i - 1 / 2 ^ te (m rei (/3 n ),y rei )(Z ni + a'^iUjXiX&^Ui) - a Pn (U i )) T X i , 
i=i 

K 2 = n" 1 ^ ^ gi (m m (/3 n ),y m )(a^(^) - a'^U^fX,. 

8=1 

Then by Taylor's expansion, Lemma [H and Condition (C), 

rr 1/2 (VQn(/3 n ) - VQ„(/3J) = Ki + K 2 + smaller order terms. 

Define, for Q as in Condition (F), 

S = {feC 2 (n):\\f\\ 00 <l}, 

equipped with a metric p(f u f 2 ) = \\fi - /2IU, where \\f\\oo = sup Men |/(w)|. We also let, 
for r = 1, • • • , q and / = 1, • • • ,p n , 

A rl (y,u,X,Z n ) = q 2 (X T ap(u) + Zl(3 n ,y)X r (z nl + X TC 



<9A 



7W = 5n(^ +1 + C n l0g 1/2 (l//l))- 



B r (y,u,X,Z n ) = qi {X T af3n {u) + Z T n [3 n ,y)X r . 

By Lemma (U for any positive sequences (8 n ) with 5 n — * as n — * 00, we have 
Po(\ r G S) — > 1 and Po(7w £ — ► 1, where 

Ar = 5n(/i P+i + Cn ,log 1 / 2 (l// i ))" 1 (dg - ajg), 

'daZ 1 daw' 

Pn _ 

9(3nl d(3 n l 

r = 1, ■ ■ ■ , q and / = 1, ■ • • ,p n . Hence for sufficiently large n, we have A r , 7^ G 5. The 
following three points allow us to utilize [15] to prove our lemma. 

I. For any v G S, we will view the map v 1— »• A r ;(|/, u, X, Z n )v(u) as an element of 
C(S), the space of continuous functions on S equipped with the sup norm. For 

V\, i>2 G S 1 , we have 

\A rl (y,u,X,Z n )vx(u) - A rl {y,u,X,Z n )v 2 {u)\ 
= \A H (y,u,X, Z n )(«i -v 2 )0)| < I it, X,Z n ) 1 11^ -t> 2 ||- 
Similar result holds for 5 r (?/, w, X, Z„). 
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II. Note that equation (12. 2p is true for all (3 n , and by differentiating w.r.t. (3 n we get 
the following formulas: 

E {q 1 {m n (J3 n ),Y n )X.\U = u) = 0, 
E (q 2 (m n ((3 n ),Y n )X(Z n + a^(C/)X) T '\U = u) = 0. 

Thus, we can easily see that 

£ o (A-z(^,X,Z n )) = 

for each r = 1, • • • , q and / = 1, • • • , p n . Also we have 

E (A rl (Y,U,X, Z„) 2 )<oo, 

by Regularity Conditions (A) and (C). For B r (Y, U, X, Z n ), results hold similarly. 

III. Let H(-,S) denote the metric entropy of the set S w.r.t. the metric p. Then 

H(e, S)<C e^ 

for some constant C . Hence if 1//2 (e, S)de < oo. 

Conditions of Theorem 1 in [15] can be derived from the three notes above, so that 
we have 

n 

n- 1 / 2 ^A r/ (F i ,f/ i ,X i ,Z m )(-), 

i=l 

where A r i(Yi, Ui, Xj, Z n j)(-), i = 1, • • • , n being i.i.d. replicates of A r i(Y,U,X,Z n )(-) in 
C(S), converges weakly to a Gaussian measure on C(S). Hence, since \ r ,j r i G ^ 



^ c/i, x, z m )(A r ) = o P (i), 

t=l 

which implies that 

n 

- 1 ' 2 MYi, U u X 4) Z ni )(a£ - ag) = Op(5„" 1 (^ +1 + c n logV^l/Zi))). 



i=l 



Similarly, apply Theorem 1 of [15] again, we have 



n 



£ Sr(Kt , 17*, X, Z m ) ( - ) = OpiS-^b* 1 + c n log 1 / 2 (l// i ))). 
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Then the column vector Ki which is p n — dimensional, has the / th component equals 

q , n -s 

J2\ n~ l/2 J2 MYi, U u X,, Z ra )(dg - = P {8-\hP +1 + cjog^il/h)), 



r=l v i=l J 

using the result just proved. Hence we have shown 

IIKiH =Op( v ^5 ? ; 1 (^ +1 + c„log 1 / 2 (l//i))) =o P (l), 

since 5 n can be made arbitrarily slow in converging to 0. Similarly, we have ||K 2 || = op(l) 
as well. The conclusion of the lemma follows. □ 

Proof of Theorem [D 

Let 7„ = \/p n /n. Our aim is to show that, for a given e > 0, 

pj sup Q n {(3 n0 + 7„v) < Q n ((3 n0 )\ > 1 - e, (5.10) 

so that this implies with probability tending to 1 there is a local maximum f3 n in the ball 
{Pno + 7nV : || v || < C} such that \\(3 n - @ nQ \\ = P {^ n ). 

Define the terms I x = ^ n V T Q n (/3 n0 )v, I 2 = ^v T V 2 (5„(/3 n0 )v and 
J 3 = 2iv T (v T V 2 Qn(/3*)v)v. By Taylor's expansion, 



<0n(/3 nO + 7nV) - <0„(/3 nO ) = h + J 2 + /; 



3 j 



where /3* lies between /3 n0 and /3 n0 + 7„v. 
We further split I x = D x + D 2 , where 

n 

Di = ^2q 1 (m ni ((3 nQ ),Y ni )(Z ni + a^ no (t/ i )X i ) T v7 ri , 

n 
i=l 

with m n i(f3 n ) = ctf3 n (Ui) T Xi + /3^Z n j. By Condition (A) and Lemma [H D2 has order 
smaller than Z^. Using Taylor's expansion, we have 

D\ = 7 n v T ( ^™(^"o) + \/nKi J + smaller order terms, 

where Kx is as defined in Lemma [2] so that within the lemma's proof we have ||Ki|| = 
op(l). Using equation (12. 61) . we have by the mean-variance decomposition 

v k w» 



O P (Jnv T I n (f3 n0 )v) =Op(V^) || v| 
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where last inequality follows from Condition (B). Hence 

| Il| = Op( V / ^7n)||v||. 

Next, consider I 2 = I2 + (1 2 — h), where 

Tl Tl 

= -- v T I n (/3 n0 )v 7 2 + - v T {n^V 2 Q n ((3 n0 ) + I n (f3 n0 )}W n 

Tl 

= -^ T In(f3 n0 Wn + OpW)||v|| 2 

with the last line follows from Lemma [5] in the Appendix. Using Lemma HI 

||I 2 - I 2 || = o P (ri 7 2 ||v|| 2 ). 

On the other hand, by Condition (B), we have 

|n 7 2 v T I n (/3 n0 )v| > O(n 7 2 A min (I n (/3 n0 ))||v|| 2 ) = 0(n 7 2 || v|| 2 ). 

Hence, I 2 — I2 has a smaller order than I2. 

Finally consider I3. We suppress the dependence of ctp (Ui) and its derivatives on 
Ui, and denote q u = q 1 (m ni (f3 n0 ),Y ni ). Using Taylor's expansions, expanding Q n (/3*) at 
f3 n0 and then Q n (f3 n0 ) at ctp no , we can arrive at 

n 



Qn(P* n ) = Qn(Pno) + X^liXf(«/3„ " a Pj 

i=l 

+ q lt (Z ni + a^ no X i ) T (/3; - (3 n0 )}(l + o P (l)). 

Substituting Q n {f3* n ) into I3 with the right hand side above, by Condition (C) and Lemma 
[TJ we have 

I3 = \ Y] a a a^a°2 v i v 3 v kll + smaller order terms. 
6 d(3 ni d(3 nj df3 nk 

Hence, 

|I 3 | = Op(^/ 2 7 ^||v|| 3 ) = Op( v ^||v||)r i7 2 ||v|| 2 = op(l)n 7 2 ||v|| 2 . 

Comparing, we find the order of — n 7 2 v T I n (/3 n0 )v dominates all other terms by allowing 
||v|| = C to be large enough. This proves f)5.10p . □ 
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Proof of Theorem [21 

Note that by TheoremCQ ||/3 n — P n0 \\ = Op{^Jp n /n). Since VQ n ([3 n ) = 0, by Taylor's 
expansion, 

Vg„(/3„ ) + V 2 g„(/3 nO )09 n - (3 n0 ) + C = 0, (5.11) 

where /3* lies between /3 n0 /3 n and C = \{0 n - (3 nO ) T V 2 (VQ n (/3* l ))0 n - f3 n0 )) which is 
understood vector of quadratic components. 

Using similar argument to approximating J3 in Theorem [T], by Lemma [T] and noting 



\Pn - /'noil = We haVe H V 



2 9Qn(/3;) ||2 



d/3 n 



Op(n 2 pl). Hence 



<9A 



P (^,/n 2 ) = op^' 1 ). (5.12) 



At the same time, by Lemma [5] and the Cauchy-Schwarz inequality, 

H^V'Q^noX^n " Pn ) + I n (Pn O )0n ~ Pr, 



'nO) 



(5.13) 



(5.14) 



= op((r i p n )" 1/2 ) + Op(V^(^ +1 + c n log 1/2 (l//i))) = opin- 1 ' 2 ). 
Combining fl5.lip . fl5.12l) and f)5.13p . we have 

= n- 1 VQ n (/3 n0 ) + op(n- 1 / 2 ), 
where the last line follows from Lemma [21 Consequently, using equation (15.141) . we get 

v^A n /y 2 (/3 nO )0 n - (3 nQ ) = n- 1 / 2 A n /- 1 / 2 (/3 n0 )Vg n (/3 n0 ) 

+ o P (A n I- 1 / 2 (f3 n0 )) (5.15) 
= n- 1 / 2 A n J- 1 / 2 (/3 n0 )Vg n (/3 n0 ) + op(l), 

since ||A n J n (P n0 )\\ = 0(1) by conditions of Theorem[2j 

We now check the Lindeberg- Feller Central Limit Theorem (see for example, [29]) for 
the last term in ( 15. 15ft . Let B ni = n~ l ^ 2 A n I n 1 ^ 2 (f3 n0 )'VQ n i(f3 n0 ), i = 1, • • • , n. Given 
e > 0, 



^£ ||5m|| 2 l{||5m|| > e} < n^E \\B nl \\±-n\\B nl \\ > e). 



i=l 



Using Chebyshev's inequality, 

P(||5ni|| > e) < n~ l e^ 2 E\\A n I~ 1 / 2 (f3 n0 )VQ nl (f3 n0 ) 



(5.16) 
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where tr(A) is the trace of square matrix A. Similarly, we can show that, using Condi- 
tion (B), 

E \\B nl \\ 4 < v / /n- 2 AL n (A n ^)AL x (/n(/3 n0 ))\/^o(Vg nl (/3 ri0 ) r Vg ril (/3 n0 )) 4 

(5.17) 

= 0(p 2 Jn 2 ). 
Therefore ( 15.161) and (15. 17ft together imply 

n 

Y,E,\\B ni \\ 2 l{\\B m \\ >e} = O(jtfjn) = o(l). 
i=i 

Also, 

n 

£)Vax (fl ni ) = Var (A n J- 1 / 2 (/3 n0 )VQ nl (/3 n0 )) 

i=l 

= A n A n ► C 

Therefore _B n j satisfies the conditions of the Lindeberg-Feller Central Limit Theorem. 
Consequently, using ( 15.151) . it follows that 

^iA n ll' 2 {(3 nQ ){K - (3 n0 ) -±> N(0,G), 

and this completes the proof. □ 

Referring back to Section [2721 let B n be a (p n — l)xp n matrix satisfying B n B^ = I Pn -i 
and A n B^ = 0. Since A n f3 n = under H , rows of A n are perpendicular to (3 n and the 
orthogonal complement of rows of A n is spanned by rows of B n since A n B^ = 0. Hence 

under H , where 7 is a (p n — l)xl vector. Then under H the profile likelihood estimator 
is also the local maximizer 7 n of the problem 

Qn(Bn%) = maxQ n (5j7j. 

Proof of Theorem [3J, 

By Taylor's expansion, expanding Q n (B'^'y n ) at j3 n and noting that V T Q n ((3 n ) = 0, 
then Qn(f3 n ) ~ Qn(B^j n ) = Ti + T 2 , where 

= --(A* - i?r7j T v 2 g n (^ n ,)(^ n - Sj7 B ), 

T 2 = \v T {{K - B^%) T \/ 2 Q n ((3:)0 n - B?%)}0 n - Bl%). 
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Denote by 6 n = I n (0 nO ) and <fr n = ^VQ n (0 nO ). Using equation (15.141) and noting that 
G n has eigenvalues uniformly bounded away from and infinity by Condition (B), we 
have 

^n-/3n = ©n 1 ^n + Op(^ 1/2 ). 

Combining this with Lemma [6] in the Appendix, under the null hypothesis H , 
-B T -y =9- 1 /2rj — Q 1 / 2 B T (B B^"^ 9 1/2 18 4/2 $ 



(5.1J 



+ op[n 

Since = J Pn — Ql/ 2 B^(B n Q n B^)~ l B n Ql/ 2 is ap n xp n idempotent matrix with rank 
I, it follows by mean- variance decomposition of the term \\/3 n — B^ n \\ 2 and Condition 
(B) that 

\\h-B T n l n \\=0 P {n~V*). 

Hence, using similar argument as in the approximation of order for | j 3 | in Theorem [TJ we 
have 

\T 2 \ = P (np 3 J 2 ) -0 n - B^J 3 = o P (l). 

Hence Q n n ) - Q(B%%) = T 2 + o P (l). 

By Lemma [5] and an approximation to n~ 1 \\V 2 Q n (j3 n ) — V 2 Q n (0 nQ )\\ = op(pn 1 ^ 2 ) 
(the proof is similar to that for Lemma [3] with the proof of order for \I 3 \ in Theorem [TJ 
and is omitted), we have 

l -{Pn - B^ n f{V 2 Q n n ) + nI n (0 nQ )}0 n - B?%] 
= P (l/n) ■ n{op{p~ 1 ' 2 ) + P (p n (h p+1 + c n \og 1/2 (l/h)))} = o p (l). 

Therefore, 



Qn(Pn) ~ Qn(B^ n ) = ~{0 n - B^ n f I n (0 nO ) (0 n - B%%) + P (1). 



By (JEHD, we have 



Qn{frn) ~ Qn(B^ n ) = -G^ 1 ' 2 S n Q?'** n + P (l). 

Since S n is idempotent, it can be written as S n = D^D n where D n is an I x p n matrix 
satisfying D n D^ = h- By Theorem [21 we have already shown that y/nD n Qn l ^ 2 ^n 
N(0,li). Hence 



- Xl □ 
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Proof of Theorem |4j 

Let An = -n- 1 V 2 g n (^ n ), B n = mv{VQ n (P n )} and C = 4(/3 n0 ). Write 

h = A-\B n - C)A~\ h = A~\C - An)A-\ h = A~\C - A n )C~\ 
Then, S n — E n = I x + J 2 + I3. Our aim is to show that, for alH = 1, • • • ,p n , 

Ai(S n - E n ) = op(l), 

so that A n (Tj n — Ti n )A^ 0, where Xi(A) is the ith eigenvalue of a symmetric matrix 
A. Using the inequalities 

Amin 

(il) + A m i n (/2) + A m i n (/3) < A m i n (/i + I 2 + I3) 

Amax(A + I2 + ^3) — A max (/i) + A max (/2) + A max (/3), 

it suffices to show that A, (/_,-) = op(l) for j = 1,2,3. From the definition of Ii,I 2 and 
J 3 , it is clear that we only need to show Aj(C — A n ) = op(l) and Aj(B„ — C) — op(l). 
Let K x = I n (J3 n0 ) + n~ 1 V 2 g n 09 nO ), K 2 = n-^V^O&J - V 2 Q ra (/3 n0 )), and # 3 = 
n-^V^^J-V^^J). Then, 

C - X = K x + K 2 + K 3 . 

Applying Lemma to Ki, Lemma [3] to K 2 , and Lemma H] to K 3 , we have \\C— A\\ =op(l). 
Thus, \{C — A) = op(l). Hence the only thing left to show is Xi(B n — C) = op(l). 
To this end, consider the decomposition 

B n -C = K 4 + K 5 

where 

_ fl " dQrM 9Qm0n) \ J(R x 

Ka ~ [n^-dfc w~r)~ " n(/3no) ' 

~ f 1 A dQ ni n ) ) f 1 9 Qni(Pr, 

Our goal is to show that and are op(l), which then implies Xi(B n — C) = op(l). 
We consider K± first, which can be further decomposed into K4 = K e + K 7 , where 

1 -A dQni0 n ) dQ ni n ) 1 -A dQ ni ((3 n0 ) dQ ni (f3 T 



\ n ^ d(5 nj dp nk n ^ <9/? ni d(3 nk 
v / 1 A dQ ni (P n0 ) dQ ni ((3 n0 ) \ . , 



1=1 
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Observe that 



1 A dQ ni ((3 n0 ) / dQ ni ((3 n ) dQ ni (/3. 



n 



0$ 



nk 



dp 



nk 



n <rrf dp nk 

1 = 1 



l^(dQ ni ((3 n ) dQ ni ((3 n0 )\ (dQ ni (j3 n ) dQ ni ((3 n0 ) 



-E 



dP 



it k 



dp 



nk 



dp, 



dp 



and this suggests that an approximation of the order of ai—(Qni(Pn) ~ Qm{P n n)) f° r 
each k — 1, • ■ • ,p n and i — 1, ■ ■ • , n is rewarding. Define 

Oifc = 777^— {Qni(Pn) ~ Qni(P n )), and = ^—(Qni(P n ) ~ Qni(Pno))' 



d(3 nk 

of ctp n (Ui) and its derivatives on XJ, { 



dp 



it k 



then -Q^(Qni(f3 n ) — Qni(Pno)) = a ik+bik- By Taylor's expansion, suppressing dependence 



O-ik 



9 2 Qm(P n ) 



(l + o F (l)). 



df3 nk doL T p » ' da T \dp nk df3 nk 

Using Lemma [H Condition (C), with argument similar to the proof of Lemma HI we then 
have 

a lk = P (hv +1 + c n \og l l 2 {l/h)). 
Similarly, Taylor's expansion gives 

d 2 Qni{Pno) 



bik 



-(/3 n -/3 n0 )(l + Op (l)), 



which implies that, by Theorem [1] and Regularity Condition (C), 

\hk\ = P {^/p 2 Jn). 
Using the approximations of and b{ k above, by Condition (C), 
1 ^ dQ ni ((3 n0 ) (dQ ni (f3 n ) dQ ni (f3 n0 ) 



-E 



1 dpnj 



dP 



it k 



dp, 



nk 



n 

< 1 E 

i=l 



dQni(Pno) 



dp, 



\a>ik + hk\ 



P (h p+1 + c n \og 1/2 (l/h) + n x ' 2 p n ). 



32 



This shows that 



\\K 6 \\ =0 P (p n (h^ 1 + c n log 1 / 2 (l/h))+pln' 1 / 2 ) = o P (l) 

by the conditions of the theorem. 
For K-?, note that 

jp v -2( 2 \ TP f d Qni(Pno) dQni(Pno) rp ( dQni(f3 n0 ) dQ ni (P nQ ) 

&qK 7 = n {np n )h < — — — £/ — ^- ^- 

= 0( P 2 Jn) 

which implies that ||(-f^7)|| = Op(p 2 l /n) = o(l). Hence using K± = K e + K7, 
\\K 4 \\ = o P (l) + P (p n (h? +l + c n log^l/fc)) + ^pj/n) = o P (l). 

Finally consider K 5 . Define Aj = n' 1 Yl^ii^j + %) + 77,-1 Y^7=i dQri Q^ 3n0 \ where Oj 
and bij are defined as before, we can then rewrite K 5 = {AjA^}. Now 



\Aj\ < sup \a,ij + bij \ + 



T7 



9Qni{PnQ 



n ^ d(3 n 



i=i 

P {h p+1 + c n log 1/2 (l//i) + n-^V) + Opin- 1 / 2 ), 



where the last line follows from the approximations for and bij, and mean- variance 
decomposition of the term n _1 Y^i=i ^"^ • Hence 

ll^sll =Op(p n (/ i p+1 + c n log 1/2 (l//i)+n- 1 /V) 2 ) = op(l), 
and this completes the proof. □ 

Proof of Theorem HJ 

In expression ( 12.41) . we set p — 0, which effectively assumes otp (Uj) ~ (it) for f/j 
in a neighborhood of it. Using the same notation as in the proof of Lemma [U we have 
a ni (u) = a / 3 n (n) T X i + Z^/3 n , (3* = c^a^u) - 0^(11)) and X* = X*. Following the 
proof of Lemma [U we arrive at equation (15. 8p . which in this case is reduced to 

g2(X^a 0/3 » + ZlP n , Y ni ) (z nij + ) X^X^(^-n) =0. 

Solving for 8a ?^ n — from the above equation, which is true for j = 1, • • • ,p n , we get the 
same expression as given in the lemma. 
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Hence it remains to show that 



da oj3n w j g a cons i s tent estimator of cx' R (u). However 



this is done by the proof of Lemma [T] already, where equation (15.91) becomes 



d(3 n 

and the proof completes. □ 



P (^- n (h + c n \og 1/2 (l/h))) =o P (l) 



APPENDIX: PROOFS OF LEMMAS [3] - El 

Lemma 3 Assuming Conditions (A) -(G) andp^/n = o(l), we have 

n-lV^nO&J - V 2 Q n (/3 n0 )|| = o P (l). 
Proof of Lemma Consider 



n-^Qnfpj - V 2 Q n (/3 nC 



1 Pn 

n 2 ^— ' 



d 2 Q n n ) d 2 Q n (f3 n[ 



Pn / P 



f'i / fit 

n 2 ^ ( E 



< 



- Pn Pn 

EE 



dfinidfinjdfink 

d 3 Q n ([3*) x2 

1 ZT^ \ df3nidflnjd0nk 



(fink — Pok) ^ 

\Pnk — Pok\\ 2 , 



where (3* lies between (3 n and (3 n0 . Similar to approximating the order of I3 in the proof 
of Theorem [TJ the last line of the above equation is less than or equal to 

n- 2 O p {n 2 vl)\\K ~ Pnof = n- 2 P (ny n )Op( Pn /n) = o P {\) 
by the conclusion of Theorem [TJ □ 

Lemma 4 Assuming Regularity Conditions (A) - (G), we have for each (3 n e Q n , 
n- x \\V 2 Q n {f3 n )-V 2 Q n {P n )\\ = Op(p„(^ +1 + c n log 1 / 2 (l// i ))). 

Proof of Lemma [^} By Taylor's expansion and Lemma [TJ 

-1 d 



n 



d/3, 



-(VQ n (/3J - VQ n (/3J) 



ilk 



n 



d 3 Q n (f3 n ) 



+ 



d(3 nk df3 n dcx T 

t-'n 

9&'i3 n da' p \dQ n (l3 n 



oc/3 n - apj + 



d 2 Qn(P n ) ( d6if3 n doCR 



df3 n doL T p. V 8(3, 

t-'n 



Oft 



ilk 



Opnk dp nk ) dOLp n 



j ) d 2 QM 
da.p n d(3 nk 



;i + p(i)) 
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Hence, using Regularity Condition (C), 



n 



nk 



■(vg n (/3j-vg n (/3j) 



d/3 nk 



nk 



= 0(1) • (sup \\ap n (Ui) -ap n (Ui)\\ +sup 
+ sup W&'p^Ui) - a' p (Ui)\\ + sup 

i i 

= P (V^(^ +1 + c n log 1/2 (l// i ))), 
where the last line follows from Lemma HJ Hence 

n- l \\V 2 Q n ((3 n ) -V 2 Q n ((3 n )\\ = P ( Pn (h? +1 + c n \og l l\l/h))). □ 



Lemma 5 Under Regularity Conditions (A) -(G) and p^/n = o(l), 

= opip- 1 ) + P (p n (^ +1 + c n log 1 ^!//*))). 



|n- 1 V 2 Q n (/3 n0 ) + / n (/3 n0 ) 

|n- 1 v 2 g n (A l0 ) + ^(/3 n0 ) 



Proof of Lemma [2 The first conclusion follows from 
E p 2 Jn^V 2 Q n ((3 n0 ) + I n (f3 n0 )\\ 2 



2 -2 



— £, " 



^ I $ ftnid f3 n j 0/3 n iO/3 n j 
From this, triangle inequality immediately gives 



Otfjn) = o(l). 



Wn-'V'QniP^) + I n (f3 n0 )\\ = opip- 1 ) + Hn-^CO^o) - Q n (f3 n0 ))\\- 
The second equation then follows from Lemma HI □ 

Lemma 6 Assuming the conditions in Theorem^ and under the null hypothesis Ho 
in the theorem, 

B T n {l n - Tno) = ^BliBJ^^BD^Bl VQ n (/3 n0 ) + o P (n^ 2 ). 



Proof of Lemma® Since B n B^ = I Pn -i, for each v e MP" l , we have 

IIBjvll < II v II . 



(5. 
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Following the proof of Theorem[TJ we have ||-B^(7 n — 7 n )|| = Op(^p n /n). Following 
the proof of Theorem [2] and by Lemma [2j 

J n (/3 n0 )^( 7n - 7n0 ) = rC x VQ n {p nQ ) + op^" 1 / 2 ). 

Left-multiplying with B n and using equation ( 15. 19ft . the right hand side of the above 
equation becomes n~ l B n VQ n ((3 n{i ) + op(n -1 / 2 ). Hence, 

Bl (7 n - 7„„) = n- l Bl(BJ n {(3 n ,)Bly l B n VQ n {(3 n ,) + M^ 1/2 ), 

since B n I n (f3 n0 )B^ has eigenvalues uniformly bounded away from and infinity, like 
4(/3 n0 ) does. □ 
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